Mining Statistically Significant Substrings using the Chi-Square Statistic
نویسندگان
چکیده
The problem of identification of statistically significant patterns in a sequence of data has been applied to many domains such as intrusion detection systems, financial models, web-click records, automated monitoring systems, computational biology, cryptology, and text analysis. An observed pattern of events is deemed to be statistically significant if it is unlikely to have occurred due to randomness or chance alone. We use the chi-square statistic as a quantitative measure of statistical significance. Given a string of characters generated from a memoryless Bernoulli model, the problem is to identify the substring for which the empirical distribution of single letters deviates the most from the distribution expected from the generative Bernoulli model. This deviation is captured using the chi-square measure. The most significant substring (MSS) of a string is thus defined as the substring having the highest chi-square value. Till date, to the best of our knowledge, there does not exist any algorithm to find the MSS in better than O(n) time, where n denotes the length of the string. In this paper, we propose an algorithm to find the most significant substring, whose running time is O(n) with high probability. We also study some variants of this problem such as finding the top-t set, finding all substrings having chi-square greater than a fixed threshold and finding the MSS among substrings greater than a given length. We experimentally demonstrate the asymptotic behavior of the MSS on varying the string size and alphabet size. We also describe some applications of our algorithm on cryptology and real world data from finance and sports. Finally, we compare our technique with the existing heuristics for finding the MSS. 1. MOTIVATION Statistical significance is used to ascertain whether the outcome of a given experiment can be ascribed to some extraneous factors or is solely due to chance. Given a string composed of characters from an alphabet Σ = {a1, a2, . . . , ak} of constant size k, the null hypothesis assumes that the letters of the string are generated from a memoryless Bernoulli model. Each letter of the string is drawn randomly and independently from a fixed multinomial probability distribution P = {p1, p2, . . . , pk} where pi denotes the probability of occurrence of character ai in the alphabet ( P pi = 1). The objective is to find the connected subregion of the string (i.e., a substring) for which the empirical distribution of single letters deviates the most from the distribution given by the Bernoulli model. Detection of statistically relevant patterns in a sequence of events has drawn significant interest in the computer science community and has been diversely applied in many fields including molecular biology, cryptology, telecommunications, intrusion detection, automated monitoring, text mining, and financial modeling. The applications in computational biology include assessing the over representation of exceptional patterns [7] and studying the mutation characteristics in the protein sequence of an organism by identifying the sudden changes in their mutation rates [18]. Different studies suggest detecting intrusions in various information systems by searching for hidden patterns that are unlikely to occur [26, 27]. In telecommunication, it has been applied to detect periods of heavy traffic [13]. It has also been used in analyzing financial time series to reveal hidden temporal patterns that are characteristic and predictive of time series events [22] and to predict stock prices [17]. Quantifying a substring as statistically significant depends on the statistical model used to calculate the deviation of the empirical distribution of single letters from its expected nature. The exact formulation of statistical significance depends on the metric used; p-value and z-score [23, 25] represent the two most commonly used ones (some of the other ones are reviewed in [10, 24]). Research indicates that in most practical cases, p-value provides more precise and accurate results as compared to z-score [7]. The p-value is defined as the probability of obtaining a test statistic at least as extreme as the one that was actually observed assuming the null hypothesis to be true. For example, in an experiment to determine whether a coin is fair, suppose it turns up head on 19 out of 20 tosses. Assuming the null hypothesis, i.e., the coin is fair, to be true, the p-value is equal to the probability of observing 19 or more heads in 20 flips of a fair coin: p-value = Pr(19H) + Pr(20H) = ` 20 19 ́
منابع مشابه
Mining Statistically Significant Patterns using the Chi-Square Statistic
Statistical significance is used to ascertain whether the outcome of a given experiment can be ascribed to some extraneous factors or is solely due to chance. An observed pattern of events is deemed to be statistically significant if it is unlikely to have occurred due to randomness or chance alone. In the thesis, we study the problem of identifying the statistically relevant patterns in string...
متن کاملMining Statistically Significant Substrings Based on the Chi-Square Measure
Given the vast reservoirs of data stored worldwide, efficient mining of data from a large information store has emerged as a great challenge. Many databases like that of intrusion detection systems, web-click records, player statistics, texts, proteins etc., store strings or sequences. Searching for an unusual pattern within such long strings of data has emerged as a requirement for diverse app...
متن کاملMost Significant Substring Mining Based on Chi-square Measure
Given the vast reservoirs of sequence data stored worldwide, efficient mining of string databases such as intrusion detection systems, player statistics, texts, proteins, etc. has emerged as a great challenge. Searching for an unusual pattern within long strings of data has emerged as a requirement for diverse applications. Given a string, the problem then is to identify the substrings that dif...
متن کاملNetwork Anomalies Detection Using Statistical Technique : A Chi- Square approach
Intrusion Detection System is used to detect suspicious activities is one form of defense. However, the sheer size of the network logs makes human log analysis intractable. Furthermore, traditional intrusion detection methods based on pattern matching techniques cannot cope with the need for faster speed to manually update those patterns. Anomaly detection is used as a part of the intrusion det...
متن کاملChi-squared computation for association rules: preliminary results
Chi squared analysis is useful in determining the statistical significance level of association rules. We show that the chi squared statistic of a rule may be computed directly from the values of confidence, support, and lift (interest) of the rule in question. Our results facilitate pruning of rule sets obtained using standard association rule mining techniques, allow identification of statist...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- PVLDB
دوره 5 شماره
صفحات -
تاریخ انتشار 2012